A Partitioned Similarity Search with Cache-Conscious Data Traversal

نویسندگان

  • Xun Tang
  • Maha Alabduljalil
  • Xin Jin
  • Tao Yang
چکیده

All pairs similarity search (APSS) is used in many web search and data mining applications. previous work has used techniques such as comparison filtering, inverted indexing, and parallel accumulation of partial results. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-phase approach called Partition-based Similarity Search (PSS). The first phase is to partition the data and group vectors that are potentially similar. The second phase is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Due to data sparsity and the presence of memory hierarchy, accessing feature vectors during the partition comparison phase incurs significant overhead. This paper introduces a cache-conscious design for data layout and traversal to reduce access time through size-controlled data splitting and vector coalescing, and it provides an analysis to guide the choice of optimization parameters. The evaluation results show that for the tested datasets, the proposed approach can lead to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency with one order of magnitude of performance improvement and it can also be integrated with LSH for approximated APSS.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CC-GiST: Cache Conscious-Generalized Search Tree for Supporting Various Fast Intelligent Applications

According to the advance of technologies, the speed gap between CPU and main memory is getting larger every year. Due to the speed gap, it was perceived important to make the most use of the cache residing between CPU and main memory, and there have been a lot of research efforts on this issue. Among those is the research on cache conscious trees for reducing the cost for accessing main memory ...

متن کامل

The Dense Skip Tree: A Cache-Conscious Randomized Data Structure

We introduce the dense skip tree, a novel cache-conscious randomized data structure. Algorithms for search, insertion, and deletion are presented, and they are shown to have expected cost O(logn). The dense skip tree obeys the same asymptotic properties as the skip list and the skip tree. A series of properties on the dense skip tree is proven, in order to show the probabilistic organization of...

متن کامل

Streaming Similarity Search over One Billion Tweets Using Parallel Locality-Sensitive Hashing Citation

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kdtrees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper,...

متن کامل

Effect of Node Size on the Performance of Cache-Conscious Indices

In main-memory environments, the number of processor cache misses has a critical impact on the performance of the system. Cache-conscious indices are designed to improve the performance of mainmemory indices by reducing the number of processor cache misses that are incurred during a search operation. Conventional wisdom suggests that the index’s node size should be equal to the cache line size ...

متن کامل

Index Search Algorithms for Databases and Modern CPUs

Over the years, many different indexing techniques and search algorithms have been proposed, including CSS-trees, CSB+-trees, k-ary binary search, and fast architecture sensitive tree search. There have also been papers on how best to set the many different parameters of these index structures, such as the node size of CSB+-trees. These indices have been proposed because CPU speeds have been in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016